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ABSTRACT 

A Probabilistic Context Free Grammar (PCFG) for Urdu is Keywords: weights, efficiency, 
developed from a Context Free Grammar (CFG) for sentences Perso-Arabic script, 

and phrases. Probabilities are assigned to the rules with morphological system, parsing 
addition of two new terms i.e. special weights and special 

probability. Weights are assigned to rules after performing 

certain calculations. Furthermore, if a rule has zero frequency 

at present but in future it is expected to be used, then instead 

of assigning zero probability a small value (0.0001 in our case) 

is assigned to it. All such rules are added like other rules to the 

Urdu PCFG. An Urdu PCFG is thus obtained. 


Introduction 


Statistical parsers are gaining popularity day by day due to their accuracy and efficiency. A number 
of different statistical parsers are already developed (Collins, 1999; Charniak, 2000; Petrov et al., 
2006). The main idea behind any statistical parser is to assign probabilities to the grammatical 
rules. However, in practice, the probability of a parse tree being the correct parse of a sentence 
depends not just on the rules which are applied, but also on the words which appear at the leaves 
of the tree (Lakeland & Knott, 2004). 

Two main reasons are making Urdu a challenging language: first, its Perso-Arabic script and 
second, its morphological system that is having inherent grammatical forms and vocabulary of 
different languages such as Arabic, Persian and the native languages of South Asia (Humayoun, 
Hammarström & Ranta, 2007). 

In Urdu, research is done from different point of views such as creating an Urdu corpus 
(Samin, Nisar & Sehrai, 2006; Becker & Riaz, 2002) and tagging the Urdu corpus (Anwar, Wang, 
Luli and Wang, 2007). Researchers have proposed different tag sets for Urdu whose 
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number of tags is ranging from 10 (Schmidt, 1999) to 350 (Hardie, 2003). Now, one of the 
demanding areas for research in Urdu (from computational linguistics point of view) is parsing a 
corpus. Up to our knowledge, considerable amount of work has not yet been done in this area. An 
efficient and accurate parser is needed to parse Urdu corpus. 

As probabilistic parsers have the property of efficiency and accuracy, so a probabilistic 
parser is needed to parse Urdu sentences. Before developing such a parser, a Context Free 
Grammar (CFG) is a prerequisite. Furthermore, if the aim is to develop a probabilistic parser, first 
a probabilistic context free grammar (PCFG) is required. A probabilistic context-free grammar 
(PCFG) is a CFG with probabilities assigned to grammar rules, which can better accommodate the 
ambiguity and the need for robustness in real-world applications! (KeweiTu & Honavar, 2008). 
Tree-Bank based probabilistic grammar for Urdu is developed (Abbas, Karamat & Niazi, 2009). 
Now, PCFG is developed for Urdu by taking Urdu tagged sentences from different sources 
mentioned below. The development of PCFG is discussed in detail and the development steps are 
divided into different sections. Work in different sections of the research paper is organized as 
follows: 

In section 2, text tagging is discussed. In section 3, the steps taken for developing a context 
free grammar are presented. In section 4, the process of the development of PCFG (including the 
potential problems) is chalked out. In section 5, conclusion and future work is given. 


Text tagging 


Although some tagged text is made available by Center for Research in Urdu Language Processing 
(CRULP) under Urdu-Nepali-English Parallel Corpus project, but most of the sentences in this 
text are complex from parsing point of view. Therefore, apart from taking some complex sentences 
from the tagged corpus by CRULP (www.crulp.org), some more data was also collected. 
Specifically, the focus was on Urdu aqwaal-e-zareen, mazameen and mini-kahanian written by 
famous authors such as Saadat Hassan Minto, IbneInshah and PitrasBukhari. Text was POS tagged 
by utilizing the annotator provided by CRULP. The tagset with 46 tags was used for text tagging. 
This tagset is developed recently by CRULP as part of a project for developing Urdu-Nepali- 
English parallel corpus 
(http://www.crulp.org/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm). It 
follows the Penn Treebank guidelines. It has 46 tags e.g. grades of pronouns (PR, PRP$, PRRF, 
PRRFP$, and PRRL), demonstratives (DM and DMRL), several tags for verbs (VB, VBI, VBL, 
VBLI, and VBT), tags for auxiliaries showing aspect (AUXA) and tense (AUXT), NN tag for both 
singular and plural nouns, several other grades of common nouns (NNC, NNCR), two shades of 
Proper Nouns (NNP, NNPC) and a tag WALA which is used for every occurrence (and inflection) 
of word wala (Muaz, Ali and Hussain, 2009). 


Context free grammar for Urdu 


After testing 200 sentences, majority of the rules started repeating themselves, but for maximum 
accuracy additional 100 sentences were taken and they contributed further few rules. The process 
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was stopped when effort in the input text and parsing time reached up to un-feasible extent of 
either not delivering further rules or with very high time overhead. As examples from real text 
were used for the construction of rules, basic word order and free word order both were 
experienced. The developed grammar contains rules for both word orders that are mentioned 
above. A sample of the tested sentences, their corresponding rules and the output of the Chart 
parser are shown below: 


<w FOS="NNP">ju!</«> §>NP 

<u FOS="NNEO"> + Lal </a> £ 

a ee NP->NNP NNPC 
G> 

<w EOS="PREI">! Ladd / n> S.>NP 


2“ FOS="NN" Lid w> NP->PRPi NN 


</s> 


NP->NN NP@-1.2 
<u FOS="NN" zbyde tw NPV 5 >NPVP@-1,7 


pagar eA NPNN} S>NPVPG-1,7 
<w FOS="VBM>|ASe/ n> NPONNNPG.23 
<u FOS="VB">LiLe¢/n> §->NP VP@-2,7 
<w FOS="VBI"> </a> SNP VPQ-2,7 
¢/s> P>V VP->CMVPQ-3,7 
VP->VBVP@-4,) 
VP->VBVBT@-4,7 
VP->VBVP@-5,] 
VP->VBVBTQ-5,1 





Figure 1. Output of the Chart parser 
Problems faced in CFG and proposed solutions 


A $ sign in the tags (e.g. PRRFP$) was not accepted by the parser so it was replaced by 
_l‘whenever it appeared. Out of 300 sentences, 270 sentences were parsed successfully while 30 
sentences failed to parse. So, the success rate of parsing is 90%. The rules for these 270 sentences 
were collected and normalized to form a CFG. The failed 30 sentences were analyzed carefully 
and the difficulties were mostly of the following nature: 


1. In case of long and complex sentences when there were a number of rules and some rules 
were of the type 

1. NP->NN NP 

2 NP->NN VP 

3. NP->NN PP 


111 


The parser failed to parse the sentence. The reason is that the rules have NP->NN in common but 
at the end, each rule is different (resulting in ambiguity) so it is difficult for the parser to decide 
as to which of the three rules to take next. 
2. Parser usually failed due to left recursion in case of the rules such as S->VP VP. 

The failed 30 sentences and their rules were also stored in a file for future analysis (to find out 
that whether any other parser is able to handle those sentences). 


Probabilistic context free grammar for Urdu 


A PCFG is a CFG where each production is assigned probability. This probability is assigned to 
each rule by the simple formula: 
P= Number of occurrences of a rule / Total Number of occurrences. 

For calculating probabilities of the rules, the successive rules (that were used in the construction 
of the sentences) were arranged such that the rules for each of NP, VP, PP and S the rules were 
kept in separate files. The frequencies for NP, VP, PP and S were 2136, 914, 179 and 2091 
respectively. These files were subsequently used to get the frequency of individual rule and 
ultimately to get the probabilities by the formula discussed above e.g. if the frequency of the rule 
S->VP NP is 23 and the total frequency of S rules is 2091 as already mentioned, then the 
probability of this particular rule is 23/2091=0.0109. 

A section of the table showing the calculations of probabilities of NP is given below: 


FREQUENCY | PROBABILITY 





Figure 2. Probability calculations for NP 
Problems faced in PCFG and proposed solutions 


Some problems were noticed in the development of PCFG but solutions were provided for these 
problems. These solutions are in the form of assigning special weights and special probability. 

a. The rule such as VP->VB VBT (where VB and VBT are tags used for —Verb and —Verb to 
be respectively) was replaced by another rule VP->VB VP because of the existence of the rule VP- 
>VBT. This rule (i.e. VP->VBT) was not directly used in the 300 sentences that were selected 
from the sources mentioned above. According to probability calculations, its probability will be 
zero. However, this rule may be used directly in other sentences, when the software (e.g. Urdu 
Probabilistic Parser) will be applied to huge natural text. A rule with zero probability contributes 
nothing to further calculations performed (though being used in the process of probabilistic 
parsing). These calculations are: 

1. Finding the total probability of all the successive rules. 
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2. Selecting the total probability with the maximum value, as the most suitable parse of the 
sentence by Urdu Probabilistic Parser. 

All the calculations are dependent on the probabilities of different rules. To avoid the 
problem of occurrence of a rule with zero probability, special weights were assigned to all such 
rules, but with few limitations. It should be noted that special weights can be assigned to the rules 
subject to the following conditions: 

1. These particular rules are having a single term on the right hand side of the rule (e.g. in the 
rule NP- 
>JJ, JJ is a single term on right hand side) 

2. The single term on the right hand side of such rule should not be any phrase (e.g. NP, VP 
and PP) or sentence (S). 

3. The only term on the right hand side of such rule should occur as a second term in other 
remaining rules at least once. 
The special weights were calculated separately by the following procedure: 

For NP, first the occurrences of JJ (adjective), Q (quantifier), RB (adverb), NNPC (proper 
noun continue) and NNCR (combined noun continue) in the second position on the right hand side 
of a rule were counted separately. Part of speech in the second position in the rule e.g. NP->NN 
RB is RB, so RB is in the second position. The number of times RB is appearing in this position 
(in all successive rules) was counted separately. After counting their occurrences (frequency), they 
were assigned special weights. Weights were calculated by two methods: 


1. Dividing the occurrence of a particular tag (for example RB) by the total number of 
occurrences of all such tags (JJ, Q, RB, NNPC and NNCR) in the second position. 
2. Dividing the occurrence of a particular tag by the total number of rules for NP which is 


2136. The same procedure was adopted for VP and PP. 
A section of the table showing the calculations of special weights is given below: 


No Of Occurrences Weights Based On 

In The Second 

Position Of other | Percentage Special Weights 
Rules 


Col 11245 | Col 1914 


6.94 0694 0.0186 


4.89 0489 0.0131 


37.96 3196 0.1334 

















Figure 3. Assignment of special weights 


The weights calculated by the second method (i.e. last column in the above table) are providing 
smaller values than the first method. These weights (that are calculated by the second method) are 
used in PCFG as the aim is to use the smallest possible value (for accuracy) instead of assigning 
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zero probability. These weighted rules can be used normally in parsing exactly like all other rules 
in probabilistic context free grammar that are assigned probabilities. 

b. There were few rules (e.g. S ->VP, S->VP PP) that were expected to be used in future but at 
present they were not appearing in the sentences that were tested through the chart parser. Their 
frequency was zero thus resulting in a zero probability for these rules. The procedure used for 
assigning special weights was not possible to apply because of the following reasons: 

In 9 rules (out of 12 such rules in total), there were two terms on the right hand side of the rule 
(e.g. NP->NP NP), which is violation of condition 1 mentioned in section a. In one rule (i.e. S- 
>VP), the term on the right hand side of the rule was a VP, resulting in violation of condition 2. 
The right hand side of the remaining two rules (i.e. NP->PR and VP->VBJ), were not occurring as 
a second term in any other rule which is violation of condition 3. It was not possible to assign 
special weights to these twelve rules. 

To avoid this problem (i.e. including these 12 rules in PCFG with non-zero probability), all the 
twelve rules mentioned above were assigned small probability i.e. 0.0001. As we are taking 
probability values up to four decimal places (in Urdu PCFG), so this is the smallest possible value 
in this case. Any new rule required in future can be added to the PCFG with this special probability 
(with a very small effect on accuracy). 

A section of the table showing special probabilities, assigned to rules, is given below: 


Table 1. Assignment of special probabilities 








S. No RULE PROBABILITY 
1 S->VP 0.0001 
2 S->VP PP 0.0001 
3 S->S S 0.0001 





The benefits of assigning special weights and special probability to few rules are: 


1. Addition of rules (that may be required in future) with non-zero probability to the Urdu 
PCFG. 

2. Special weights are helpful in reducing the number of rules in Urdu PCFG. We have a rule 
VP-> VBL with special weight. We already have the rules VP->JJ VP, VP->RB VP and VP->ITRP 
VP in Urdu PCFG. Now if we have 3 new rules i.e. VP->JJ VBL, VP->RB VBL and VP->ITRP 
VBL then instead of adding these 3 new rules to Urdu PCFG, they are merged in already existing 
rules. The rule VP->JJ VBL is merged in VP->JJ VP; VP->RB VBL is merged in VP->RB VP 
and VP->ITRP VBL is merged in VP >ITRP VP by using the rule VP->VBL. 

A section of the table showing Urdu PCFG is given: 
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Table 2. Urdu PCFG 








S. No Rules Probabilities 
1 0.9637 S->NP VP 
2 0.0167 S->PP Vp 
3 0.0019 S->PP NP 
4 0.0109 S->VP NP 





Urdu PCFG is developed with 127 rules, including 11 rules with special weights and 12 rules 
with special probabilities assigned. 


Conclusion and future work 


Chart parser is used for accepting POS tagged text and displaying the rules that are used in the 
process of parsing. First a CFG and subsequently a PCFG for Urdu is developed. This PCFG can 
be used by the probabilistic parser for Urdu (that is to be developed) that accepts POS tagged text 
as input and generates the structure of that text. 
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